4  Multiple Linear Regression (MLR)

4.1 Introduction

Multiple linear regression (MLR) extends simple linear regression by allowing \(Y\) to depend on more than one predictor variable. This enables richer models and allows us to estimate the partial contribution of each predictor while accounting for others.

When to Extend SLR

SLR is limited to one predictor. MLR becomes appropriate when:

  • Multiple factors plausibly influence the response.
  • Excluding predictors may bias results.
  • You wish to measure the unique contribution of each predictor while controlling for others.

A graphical motivation

Lets consider a simple example using the trees data

This dataset records the height, girth (diameter) and volume

We might look at each bivariate (i.e. two variables) relationship separately:

However we might also look at the multivariate relationship

in the case of 3 variables we can also extend our visualisation to 3 dimensions:

4.2 The multiple regression linear model

The multiple linear model extends the simple linear model from Section 2.4 straightforwardly by adding the extra variables to the deterministic linear predictor, each with its own parameter.  

\[ Y = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k + \varepsilon, \quad \varepsilon \sim N(0,\sigma^2) \]

Now, instead of a single predictor \(X\), we may have several (\(k\)) predictors \(x_1, x_2, \ldots, x_k\), each corresponding to a column in our dataset. We use an index to distinguish these predictors, and each predictor \(x_i\) has a corresponding parameter \(\beta_i\) governing its effect on the response. Note that we have updated the intercept parameter \(\alpha\) from the simple linear regression section to \(\beta_0\), because it is simply another parameter in the model!

The random error term, \(\varepsilon\) stays the same, so the assumptions of MLR are very similar to those of SLR:

Assumption: Assumptions of the MLR model
  • A linear predictor, e.g. \(E[Y]=\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k\).
  • Independent and identically distributed random error terms that
    • Have a mean \(\mu=0\), and a constant variance \(Var(\varepsilon)=\sigma^2\)
    • Follow a normal distribution: \(N(0,\sigma^2)\)
Note: Multivariate Linearity

Although this is still a linear model, the term “linear” no longer refers to a literal straight line in two dimensions as it did in simple linear regression. Instead, “linear” means that the model is linear in its parameters: each \(\beta_i\) multiplies a predictor and enters additively. This linear model may live in a high-dimensional predictor space and cannot be visualised as a single line. For example, for the three dimensional data presented above (one outcome: \(Y\) + two predictors: \(X_1\), \(X_2\)), a linear model will instead look like a three dimensional plane

Partial Regression Coefficients

Even though our model is no longer just a straight line, the slope interpretation from simple linear regression is still relevant in multiple regression. Now however, \(\beta_i\) describes the expected change in the mean response for a one-unit increase in \(x_i\), holding all other predictors constant.

Example 1

Interpreting a partial coefficient If the fitted model for life expectancy includes Internet usage and BirthRate,

Internet: 0.112
BirthRate: -0.594

Then:

  • A 1% increase in internet usage is associated with an expected 0.11-year increase in life expectancy, holding other predictors fixed.
  • A one‑unit increase in birth rate corresponds to an expected 0.59‑year decrease in life expectancy, controlling for all other predictors.
Exercise 1: Calculating the expected value of a multiple regression model

4.3 Fitting MLR Models in R and the tidy() function

Fitting a multiple linear regression uses the exact same conceptual process from ?sec-leastsquares: Residuals are still defined \(e = Y- \hat{Y}\), and minimising the sum of squared residuals provides optimal and unique ‘least squares estimates’ for the model parameters \(b_0, \ldots, b_k\).

MLR uses the same lm() function as SLR (these are both ‘linear models’, after all). Now, the model formula (the first argument, identified by the ~ separating the LHS and RHS) includes both predictors on the RHS, separated by a +:

Note: R formulas

and we can extract our model coefficients (\(\beta_0, \ldots,\beta_3\)) in the same way:

mod$coefficients
(Intercept)          x1          x2 
  1.8184485   0.5587775   1.4934882 
coef(mod)
(Intercept)          x1          x2 
  1.8184485   0.5587775   1.4934882 

We can also use the tidy function from the broom package for a nice summary of the relevant inferential statistics for each parameter:

# A tibble: 3 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)    1.82     0.388       4.68 5.21e- 6    1.05      2.58 
2 x1             0.559    0.0517     10.8  1.10e-21    0.457     0.661
3 x2             1.49     0.0483     30.9  1.48e-77    1.40      1.59 

The hypotheses being tested for each coefficient is analagous to that in the SLR case ?sec-beta_inference: \[ H_0 : \beta_i = 0 \qquad\text{vs}\qquad H_a : \beta_i \ne 0. \] And is tested the same way - using the t-statistic: \[ t_i = \frac{\hat\beta_i}{\operatorname{SE}(\hat\beta_i)}. \]

However, the interpretation of these hypotheses is slightly more nuanced. As pointed out above, the value of _i here does not represent the relationship between \(Y\) and \(X_i\) without qualification - but only when the other variables are held constant. Analogously, the hypothesis being tested by \(t_i\) is whether \(X_i\) is related to to \(Y\) once the other predictors are accounted for. ::: Key-point ### interpreting partial regression coefficients, \(\beta_i\)

Because of this:

Warning: Multicollinearity

4.4 Global model statistics: The F-Test and measures of model fit

Having multiple predictor variables expands the scope of the kind of questions we can ask about our linear model. Rather than looking at each coefficient separately, we can ask whether our model as a whole is effective in explaining the outcome \(Y\) In other words, is the combined contribution of the predictors enough to conclude that a linear relationship exists?

Formally, the global hypothesis test is:

This shift parallels the move from multiple t-tests to an ANOVA - instead of testing individual “effects,” we examine the overall variance explained by a model.

The F-statistic compares:

Because each of these quantities is constructed from squared normal deviations, the ratio

\[ F = \frac{\text{MSR}}{\text{MSE}} \]

follows an F-distribution under \(H_0\).
If the model truly explains some structure in the data, MSR will be noticeably larger than MSE.

[#^1] The links between ANOVA and linear regression will be futher explored here in ?sec-qualitative_predictors.

Example 2
In the life expectancy example, the output might report

F-statistic: 28.2 on 4 and 44 DF,  p-value: 1.19e-11

The very small p-value indicates strong evidence against \(H_0\). We conclude that at least one of the predictors (Population, Health, Internet, BirthRate) is useful for predicting life expectancy, and that the model is useful overall.

Exercise 2
Given an F-statistic of 12.3 with p = 0.0004, state the null and alternative hypotheses and conclude whether the model is useful at the 5% level.

global statistics with glance()

We can obtain the model F-statistic using the glance() function from the broom package

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.841         0.839  2.00      521. 2.24e-79     2  -420.  849.  862.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

This output includes several global model features besides the global test statistic and p-value.

Adjusted \(R^2\)

Once we have established that the model is useful overall, we can quantify how much of the variation in \(Y\) it explains. The measure \(R^2\) was introduced in SLR and extends naturally to MLR In multiple regression, \(R^2\) plays the same descriptive role, but its interpretation changes subtly because the model now includes several predictors.

\(R^2\) in Multiple Regression

We define \[ R^2 = 1 - \frac{\text{SSE}}{\text{SST}}, \] exactly as before. The difference lies in what \(R^2\) represents:

Because adding a new predictor can never increase SSE, it follows that \(R^2\) can never decrease when more predictors are added, even if the new predictor has little or no real relationship with the response. For this reason, \(R^2\) is not reliable for comparing models with different numbers of predictors.

Adjusted \(R^2\)

To address the fact that \(R^2\) is overly optimistic in larger models, we use the adjusted coefficient of determination: \[ R^2_{\text{adj}} = 1 -\frac{\text{SSE}/(n - k - 1)}{\text{SST}/(n - 1)}. \]

Adjusted \(R^2\):

Thus,

This completes our introduction to multiple linear regression: we now have a model with several predictors, an interpretation for its coefficients, and tools to judge whether the model is useful and how well it fits the data.

Other model fit statistics: logLikelihood, AIC, and BIC

The glance() output also includes two additional quantities: AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion).
Although they appear alongside the F-statistic and \(R^2\), they serve a different purpose.

AIC and BIC are not measures of how well a single model fits the data in an absolute sense.
Instead, they are designed for comparing multiple competing models, balancing goodness of fit against model complexity. Lower values indicate a preferable trade-off, but only relative comparisons are meaningful.

Because AIC and BIC are tools for model selection rather than model assessment, we do not interpret them here. They will be discussed in detail in Module 5 (Principles of Model Building), where they are used to guide decisions about which predictors to include in a regression model.